PASTASpark: multiple sequence alignment meets Big Data

نویسندگان

  • José Manuel Abuín
  • Tomás F. Pena
  • Juan Carlos Pichel
چکیده

Motivation One basic step in many bioinformatics analyses is the multiple sequence alignment. One of the state-of-the-art tools to perform multiple sequence alignment is PASTA (Practical Alignments using SATé and TrAnsitivity). PASTA supports multithreading but it is limited to process datasets on shared memory systems. In this work we introduce PASTASpark, a tool that uses the Big Data engine Apache Spark to boost the performance of the alignment phase of PASTA, which is the most expensive task in terms of time consumption. Results Speedups up to 10×  with respect to single-threaded PASTA were observed, which allows to process an ultra-large dataset of 200 000 sequences within the 24-h limit. Availability and implementation PASTASpark is an Open Source tool available at https://github.com/citiususc/pastaspark. Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Application of the ABS LX Algorithm to Multiple Sequence Alignment

We present an application of ABS algorithms for multiple sequence alignment (MSA). The Markov decision process (MDP) based model leads to a linear programming problem (LPP), whose solution is linked to a suggested alignment. The important features of our work include the facility of alignment of multiple sequences simultaneously and no limit for the length of the sequences. Our goal here is to ...

متن کامل

An Overview of Multiple Sequence Alignment Parallel Tools

Multiple sequence alignment is a key problem to most bioinformatics applications. The last ten years have witnessed a big improvement to existing multiple alignment tools and the development of new ones. Various parallel architectures have been experimented for reaching the highest level of accuracy and speed. This paper surveys most popular tools to clarify how parallelism accelerates the proc...

متن کامل

Unraveling the Complexities of Life Sciences Data.

The life sciences have entered into the realm of big data and data-enabled science, where data can either empower or overwhelm. These data bring the challenges of the 5 Vs of big data: volume, veracity, velocity, variety, and value. Both independently and through our involvement with DELSA Global (Data-Enabled Life Sciences Alliance, DELSAglobal.org), the Kolker Lab ( kolkerlab.org ) is creatin...

متن کامل

Smart machines and the SP theory of intelligence

These notes describe how the SP theory of intelligence, and its embodiment in the SP machine, may help to realise cognitive computing, as described in the book Smart Machines. In the SP system, information compression and a concept of multiple alignment are centre stage. The system is designed to integrate such things as unsupervised learning, pattern recognition, probabilistic reasoning, and m...

متن کامل

ThemisMR: An I/O-Efficient MapReduce

“Big Data” computing increasingly utilizes the MapReduce programming model for scalable processing of large data collections. Many MapReduce jobs are I/O-bound, and so minimizing the number of I/O operations is critical to improving their performance. In this work, we present ThemisMR, a MapReduce implementation that reads and writes data records to disk exactly twice, which is the minimum amou...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 33 18  شماره 

صفحات  -

تاریخ انتشار 2017